Explaining Neural Models by Interpretable Sample-Based Explanations

ABSTRACT

Sample-based model explanation techniques are provided using arbitrary spans of training data at any granularity as an explanation with increased interpretability. In one aspect, a method for explaining a machine learning model {circumflex over (θ)} includes: training the machine learning model {circumflex over (θ)} with training data D; obtaining a decision of the machine learning model {circumflex over (θ)}; masking one or more datapoints in the training data D; determining whether a new decision of the machine learning model {circumflex over (θ)} obtained after the masking is same as the decision of the machine learning model {circumflex over (θ)} obtained prior to the masking; and using the masking to explain which of the one or more datapoints in the training data D are significant. Namely, the one or more datapoints in the training data D that, when masked, change the decision of the machine learning model {circumflex over (θ)} are significant.

FIELD OF THE INVENTION

The present invention relates to explaining model behavior, and more particularly, to sample-based model explanation techniques using arbitrary spans of training data at any granularity as an explanation with increased interpretability.

BACKGROUND OF THE INVENTION

As complex natural language processing models become an indispensable tool in many applications, there are growing interests to explain the working mechanism of these models. However, in the recent advances of natural language processing, the scale of the state-of-the-art models and datasets is usually extensive, which challenges the application of sample-based model explanation methods in many aspects, such as explanation interpretability. The interpretability of an explanation refers to how easy it is for humans to understand the explanation and how it arrives at its outcome.

Among the vast number of existing techniques for explaining machine learning models, Influence Functions (IF) that use training examples as explanations for model decisions (i.e., sample-based model explainability methods) have recently gained popularity in natural language processing. See, for example, Koh and Liang, “Understanding Black-box Predictions via Influence Functions,” Proceedings of the 34^(th) International Conference on Machine Learning, Sydney, Australia, (10 pages) (August 2017) (hereinafter “Koh and Liang”). Influence Functions have been applied to explain Bidirectional Encoder Representations from Transformers (BERT)-based text classification and natural language inference models, as well as to aid text generation for data augmentation. See, for example, Devlin et al., “BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding,” arXIV:1810.04805v2 (May 2019) (16 pages) (hereinafter “Devlin”) and Radford et al., “Language Models are Unsupervised Multitask Learners,” 2019 (24 pages).

However, while useful, Influence Functions may not be entirely sufficient for natural language processing applications. For instance, the majority of existing works use entire training instances as explanations. However, for long natural language texts that are common in many high-impact application domains such as healthcare or finance, it may be difficult, if not impossible, to comprehend an entire instance as an explanation. For example, a model decision may depend only on a specific part of a long training instance.

Further, the evaluation of training-instance-based explanation methods remains an open question. Previous evaluation approaches involve either an over-simplified assumption on the agreement of labels between training and test instances or indirect or manual inspection. A process which automatically measures the semantic relations at scale, and highly correlates to human judgment is still needed.

Therefore, sample-based model explanation techniques that effectively use any appropriate granularity of spans of training data as the explanation unit providing enhanced interpretability would be desirable.

SUMMARY OF THE INVENTION

The present invention provides sample-based model explanation techniques using arbitrary spans of training data at any granularity as an explanation with increased interpretability. In one aspect of the invention, a method for explaining a machine learning model {circumflex over (θ)} (e.g., for natural language processing) is provided. The method includes: training the machine learning model {circumflex over (θ)} with training data D; obtaining a decision of the machine learning model {circumflex over (θ)}; masking one or more datapoints in the training data D; determining whether a new decision of the machine learning model {circumflex over (θ)} obtained after the masking is same as the decision of the machine learning model {circumflex over (θ)} obtained prior to the masking; and using the masking to explain which of the one or more datapoints in the training data D are significant. Namely, the one or more datapoints in the training data D that, when masked, change the decision of the machine learning model {circumflex over (θ)} are significant.

An arbitrary span of the training data can be employed for the masking. For instance, a training span x_(ij) can be identified in a training example z=(x, y) from the training data D, wherein x is a training sequence and y is the decision of the machine learning model {circumflex over (θ)}. The training span x_(ij) can be masked to provide z=(x_(−ij), y) as the masking, wherein z_(ij) is corresponding training data to the training span x_(ij), and x_(−ij) is a sequence in which training span x_(ij) is masked. An importance of the training span x_(ij) on the training example z=(x, y) can then be determined using the masking. By way of example only, the importance of the training span x_(ij) can be determined using a loss gradient.

An influence of a training example z on a test example z′ can also be determined. For instance, an importance ∇imp(x′_(kl)|z′;{circumflex over (θ)}) of a test span x′_(kl) on the test example z′ can be determined. An importance ∇imp(x_(ij)|z;{circumflex over (θ)}) of a training span x_(ij) on the training example z can be determined. The influence of the training example z on the test example z′ is then determined using ∇imp(x′_(kl)|z′;{circumflex over (θ)}) and ∇imp(x_(ij)|z;{circumflex over (θ)}). An evaluation of whether the training span x_(ij) of the training example z is semantically related to the test span x′_(kl) of the test example z′ can also be made.

In another aspect of the invention, another method for explaining a machine learning model {circumflex over (θ)} is provided. The method includes: identifying a training span x_(ij) in a training example z=(x, y) from the training data D used to train the machine learning {circumflex over (θ)}, wherein x is a training sequence and y is the decision of the machine learning model {circumflex over (θ)}; masking the training span x_(ij) to provide z=(x_(−ij), y), wherein z_(−ij) is corresponding training data to the training span x_(ij), and x_(−ij) is a sequence in which training span x_(ij) is masked; and determining an importance of the training span x_(ij) on the training example z=(x, y) using the masking. An influence of the training span x_(ij) on the machine learning model {circumflex over (θ)} can also be determined.

A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram illustrating an exemplary neural network according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an exemplary methodology for explaining a machine learning model e according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an exemplary methodology for determining the importance of a training span x_(ij) on a test example z′ according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an exemplary methodology for determining the influence of the training span x_(ij) to a test example z′ according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an exemplary methodology for evaluating whether the training span x_(ij) of z is semantically related to the test span x′_(kl) of z′ according to an embodiment of the present invention;

FIG. 6 is a table illustrating data statistics for a first dataset (Dataset 1) and a second dataset (Dataset 2) used in accordance with the present techniques according to an embodiment of the present invention;

FIG. 7 is a table illustrating semi-automatic span annotation using Dataset 1 and Dataset 2 according to an embodiment of the present invention;

FIG. 8 is a table illustrating a comparison of six explanation methods on Dataset 1 and Dataset 2 and three evaluation metrics according to an embodiment of the present invention;

FIG. 9A is a diagram illustrating the influence of K on Sag using Dataset 1 according to an embodiment of the present invention;

FIG. 9B is a diagram illustrating the influence of K on Lag using Dataset 1 according to an embodiment of the present invention;

FIG. 10 is a table illustrating a comparison of Spearman Correlation with Ground truth according to an embodiment of the present invention;

FIG. 11 is a diagram illustrating the templates used for the pre-trained reading comprehension (RC) model applied for span extraction according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating an exemplary apparatus for performing one or more of the methodologies presented herein according to an embodiment of the present invention;

FIG. 13 depicts a cloud computing environment according to an embodiment of the present invention; and

FIG. 14 depicts abstraction model layers according to an embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Provided herein are sample-based model explanation techniques that use any appropriate granularity of spans of training data as the explanation unit. As highlighted above, a sample-based explainability method uses training examples to explain a model's decision. However, by comparison with conventional approaches which employ the entire training instance as an explanation, the present techniques advantageously provide explanations with a finer-grained unit which are easier to interpret in many applications. Also provided herein is an evaluation metric that can evaluate the explanations based on the semantic relatedness with the test example. Advantageously, this evaluation is easier and more intuitive for humans to interpret.

As will be described in detail below, machine learning model {circumflex over (θ)} is first trained using training data D, after which masking of a datapoint(s) in a span of the training data (also referred to herein as a training span x_(ij)) is performed to determine an importance of the span, namely the influence of training span x_(ij) on a test example z′. For instance, it can be evaluated whether a new decision of the machine learning model {circumflex over (θ)} obtained after the masking of the training span x_(ij) is the same as a decision of the machine learning model {circumflex over (θ)} obtained prior to the masking. Further, the influence of the training span x_(ij) on a test span x′_(kl) can also be measured, as opposed to the entire test sequence.

Generally, the machine learning model {circumflex over (θ)} can be any type of machine learning system. By way of example only, one illustrative, non-limiting type of machine learning system is a neural network. In machine learning and cognitive science, neural networks are a family of statistical learning models inspired by the biological neural networks of animals, and in particular the brain. Neural networks may be used to estimate or approximate systems and cognitive functions that depend on a large number of inputs and weights of the connections which are generally unknown.

Neural networks are often embodied as so-called “neuromorphic” systems of interconnected processor elements that act as simulated “neurons” that exchange “messages” between each other in the form of electronic signals. See, for example, FIG. 1 which provides a schematic illustration of an exemplary neural network 100. As shown in FIG. 1 , neural network 100 includes a plurality of interconnected processor elements 102, 104/106 and 108 that form an input layer, at least one hidden layer, and an output layer, respectively, of the neural network 100. By way of example only, neural network 100 can be embodied in an analog cross-point array of resistive devices such as resistive processing units (RPUs).

Similar to the so-called ‘plasticity’ of synaptic neurotransmitter connections that carry messages between biological neurons, the connections in a neural network that carry electronic messages between simulated neurons are provided with numeric weights that correspond to the strength or weakness of a given connection. The weights can be adjusted and tuned based on experience, making neural networks adaptive to inputs and capable of learning. For example, a neural network is defined by a set of input neurons (see, e.g., input layer 102 in deep neural network 100). After being weighted and transformed by a function determined by the network's designer, the activations of these input neurons are then passed to other downstream neurons, which are often referred to as ‘hidden’ neurons (see, e.g., hidden layers 104 and 106 in neural network 100). This process is repeated until an output neuron is activated (see, e.g., output layer 108 in neural network 100). The activated output neuron makes a class decision. Instead of utilizing the traditional digital model of manipulating zeros and ones, neural networks such as neural network 100 create connections between processing elements that are substantially the functional equivalent of the core system functionality that is being estimated or approximated.

Some preliminary aspects are first provided. In one embodiment, a natural language processing example is considered where a model parameterized by {circumflex over (θ)}) is trained on classification dataset D={D^(train), D^(test)} by empirical risk minimization over D^(train). z=(x, y)∈D^(train) and z′=(x′, y′)∈D^(test) are used to denote a training example and a text example, respectively, where x is a text sequence, and y is a scalar. The goal of a training example-based explanation is to provide for a given text example z′ an ordered list of training examples as explanation. As highlighted above, two notable methods to calculate an influence score are IF and TracIn.

IF (see Koh and Liang) assumes that the influence of z can be measured by perturbing the loss function L with a fraction of the loss on z, and obtain

_(pert,loss)(z,z′;{circumflex over (θ)})=−∇_(θ) L(z′,{circumflex over (θ)})H _({circumflex over (θ)}) ⁻¹∇_(θ) L(z,{circumflex over (θ)}),  (1)

wherein H is the Hessian matrix calculated on the entire training dataset. As provided above, calculating H on the entire training dataset is a potential computation bottleneck for a large dataset D and complex model with high dimensional {circumflex over (θ)}.

By contrast, TracIn (see Garima Pruthi et al., “Estimating Training Data Influence by Tracing Gradient Descent,” 34^(th) Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada (November 2020) (17 pages) (hereinafter “Pruthi”) assumes that the influence of a training example z is the sum of its contribution to the overall loss through the entire training history, and conveniently leads to

$\begin{matrix} {{{{TracIn}\left( {z,z^{\prime}} \right)} = {\sum\limits_{i}{\eta_{i}\bigtriangledown_{{\hat{\theta}}_{i}}{L\left( {{\hat{\theta}}_{i},z} \right)}\bigtriangledown_{{\hat{\theta}}_{i}}{L\left( {{\hat{\theta}}_{i},z^{\prime}} \right)}}}},} & (2) \end{matrix}$

wherein i iterates through the checkpoints saved at different training steps, and i is a weight for each checkpoint. TracIn does not involve a Hessian matrix and thus is more efficient to compute.

One desirable feature for an explanation method is efficiency. Regarding efficiency, for each z′, TracIn requires

(CG) wherein C is the number of models and G is the time spent for gradient calculation. By comparison, IF needs

(N²G) wherein N is the number of training examples, and N>>C. It is notable that some approximation such as hessian-inverse-vector-product may improve efficiency to O(N SG), wherein S is the approximation step, and S>N. See, for example, Baydin et al., “Tricks from Deep Learning,” arXiv:1611.03777v1 (November 2016) (4 pages) (hereinafter “Baydin”).

Another desirable feature for an explanation method is faithfulness. IF is faithful to {circumflex over (θ)} since all its calculations are based on a single final model. However, TracIn may be less faithful to {circumflex over (θ)} since it obtains gradients from a set of checkpoints. It may be said that TracIn is faithful to the data rather than to the model and, in the case where checkpoint averaging can be used as a model prediction, the number of checkpoints maybe too few to justify Equation 2 (above).

As highlighted above, a desirable feature for an explanation method is interpretability. Both IF and TracIn methods use the entire training example as an explanation. Explanations with a finer-grained unit, e.g., phrases, may be easier to interpret in many applications where the texts are lengthy.

To improve on the above desiderata, the present techniques advantageously meet the following goals. First, the explanation method provided herein is able to use any appropriate granularity of span(s) as the explanation unit thereby improving interpretability. Second, it avoids the need of a Hessian matrix while at the same time maintaining faithfulness.

To achieve the first goal of improved interpretability with spans, influence functions such as those described in Koh and Liang are employed, and an arbitrary span of a training sequence x is evaluated for its qualification as an explanation. It is notable that this approach can be easily generalized to multiple spans. The core idea is to see how the model loss on a test example z′ changes with the training span's importance. Namely, the more important a training span is to z′, the greater this influence score should be, and vice versa.

In that regard, FIG. 2 is a diagram illustrating an exemplary methodology 200 for explaining a machine learning model {circumflex over (θ)}. In step 202, the machine learning model {circumflex over (θ)} is trained on training data D, and in step 204 a decision of the (trained) machine learning model {circumflex over (θ)} is obtained.

Masking is then used to identify datapoints in the training data D that are significant. For instance, in step 206 a datapoint(s) in the training data D is/are masked, i.e., masked datapoints are replaced with [MASK]. Next, it is determined whether or not the masking has had an impact on the decisions made by the machine learning model {circumflex over (θ)}. For instance, in step 208 a determination is made as to whether a new decision of the machine learning model {circumflex over (θ)} obtained after the masking is the same as the decision obtained in step 204, i.e., prior to the masking. In step 210, the masking is used to explain which of the datapoints in the training data D are significant. In other words, those datapoints in the training data D that, when masked, change the decision of the machine learning model {circumflex over (θ)} (i.e., the decision of the machine learning model {circumflex over (θ)} prior to the masking and after the masking are different) are significant.

According to an exemplary embodiment, methodology 200 employs an arbitrary span of the training data (also referred to herein as a training span x_(ij), i.e., from token i to token j) for the masking, and a determination is made as to the importance of that training span x_(ij) on a training example using gradient loss, which is then scaled to determine the influence of the training span x_(ij) on the machine learning model {circumflex over (θ)}. See, for example, exemplary methodology 300 of FIG. 3 . The analysis can then be expanded to obtain the influence of the training span x_(ij) to a test example z′. See, for example, exemplary methodology 400 of FIG. 4 .

Referring first to methodology 300 of FIG. 3 , the process begins with a training example z=(x, y) from training data D, wherein x is a training sequence and y is the model prediction (i.e., decision), and in step 302 at least one training span x_(ij) is identified in z=(x, y). To use an illustrative, non-limiting example, a training example z=(x, y) from training data D might be:

x=[CLS]food[SEP] The food has been good but the service really suffers[SEP]

y=positive.

[CLS] and [SEP] tokens are used to separate sentences. For example, in natural language processing, the [CLS] or classification token represents sentence-level classification and is typically used as a sentence start token, and the [SEP] or separation token separates sentences for the next task. As per step 302, the key span is identified (in bold) based on rules, asked Questions and get the Answer (Rule/QA) as:

Rule/QA: [CLS]food[SEP] The food has been good but the service really suffers[SEP].

As will be described in detail below, absent information about valid spans, shallow parsing tools and/or sentence split tools can be used to divide text sequences into chunks of text. Alternatively, deterministic, i.e., non-random, spans may be employed.

In step 304, the training span x_(v) is masked, where x_(−ij) is the sequence with x_(ij) masked, i.e., z=(x_(−ij), y), wherein z_(−ij) is the corresponding data to the training span x_(ij). Using the above illustrative, non-limiting example, as per step 304 masking the training span x_(ij) provides:

x=[CLS]food[SEP] The food[MASK][MASK][MASK] but the service really suffers[SEP]

y=positive.

In step 306, the masking z=(x_(−ij), y) is then used to determine an importance of the training span x_(ij) on training example z=(x, y) (prior to the masking). As provided above, a determination is made as to whether the masking affects a new decision of the machine learning model {circumflex over (θ)}. In the illustrative, non-limiting example, the decision of the machine learning model {circumflex over (θ)}, see the description of steps 302 (decision) and 304 (new decision) above, does not change, i.e., both decisions are y=positive.

According to an exemplary embodiment, the importance of the training span x_(ij) on training example z=(x, y), i.e., imp(x_(ij)|z,{circumflex over (θ)}), is evaluated using a loss gradient on the trained machine learning model {circumflex over (θ)} (e.g., a trained neural network). See FIG. 3 . For instance, in one exemplary embodiment, the loss difference between z_(−ij) and z is evaluated to determine how important the training span x_(ij) is. See Equation 3, below. As will be described in detail below, imp(x_(ij)|z,{circumflex over (θ)}) can then be scaled to determine the influence of the training span x_(ij) on the machine learning model {circumflex over (θ)}. See step 308.

As highlighted above, the analysis can then be expanded to obtain the influence of the training span x_(ij) to a test example z′. See, for example, methodology 400 of FIG. 4 . In step 402, the importance of a test span x′_(kl) on test example z′ (i.e., ∇imp(x′_(kl)|z′;{circumflex over (θ)})) is determined, wherein x′_(kl) is a test span from test example z′. According to an exemplary embodiment, ∇imp(x′_(kl)|z′,{circumflex over (θ)}) is determined as:

∇imp(x′ _(kl) z′;{circumflex over (θ)})=[∇

({circumflex over (θ)}+δ_(i) ,z′ _(−kl))−∇

({circumflex over (θ)}+δ_(i) ,z′)],

wherein δ_(i)=η_(i)g(z_(i)|{circumflex over (θ)}) and is obtained from a single-step gradient descent g(z_(i)|{circumflex over (θ)}) that is estimated with some training instance z_(i) on model {circumflex over (θ)}, scaled by an i-specific weighting parameter η_(i). See below.

In step 404, the importance of the training span x_(ij) on the training example z (i.e., ∇imp(x_(ij)|z;{circumflex over (θ)})) is determined, wherein x_(ij) is the training span from the training example z. According to an exemplary embodiment, ∇imp(x_(ij)|z;{circumflex over (θ)}) is determined as:

∇imp(x _(ij) |z;{circumflex over (θ)})=[∇

({circumflex over (θ)}+δ_(i) ,z)−∇

({circumflex over (θ)}+δ_(i) ,z _(−ij))].

See below.

In step 406, the importance of the training example z on the test example z′ is determined using ∇imp(x′_(kl)|z′;{circumflex over (θ)}) and ∇imp(x_(ij)|z;{circumflex over (θ)}). For instance, according to an exemplary embodiment, the importance of the training example z on the test example z′ is computed as:

∇imp(x′ _(kl) |z′;{circumflex over (θ)})∇imp(x _(ij) |z;{circumflex over (θ)}).

As provided above, the training span i defined from token i to token j is x_(ij), and the sequence with x_(ij) masked is:

x _(−ij)=[x ₀ , . . . ,x _(i-1),[MASK], . . . ,[MASK],x _(j+1), . . . ].

and its corresponding training data is z_(−ij). In this approach, an importance score is based on the logit difference. See, for example, Li et al., “BERT-ATTACK: Adversarial Attack Against BERT Using BERT, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 6193-6202 (November 2020). Namely, the logit difference is used as an importance score based on the empirical-risk-estimated parameter {circumflex over (θ)} obtained from D^(train) as: imp(x_(ij)|z,{circumflex over (θ)})=logit_(y)(x;{circumflex over (θ)}−logit_(y)(x_(−ij);{circumflex over (θ)}), wherein every term in the right hand side (RHS) is the logit output evaluated at a model prediction y from model {circumflex over (θ)} to be explained right before applying a normalized exponential function. This equation indicates how important a training span is, and is equivalent to the loss difference:

imp(x _(ij) |z,{circumflex over (θ)})=

(z _(−ij);{circumflex over (θ)})−

(z;{circumflex over (θ)})  (3)

when the cross-entropy loss

(z;{circumflex over (θ)})=−Σ_(y) _(i)

(y=y_(i))logit_(y) _(i) (x;{circumflex over (θ)}) is applied.

Next, the influence of x_(ij) on model {circumflex over (θ)} is measured by adding a fraction of imp(x_(ij)|z;{circumflex over (θ)}) scaled by a small value ϵ to the overall loss, and

θ̂_(ϵ, x_(ij)❘z) := arg min_(θ)E_(z_(i) ∈ D^(train))[ℒ(z_(i), θ)] + ϵℒ(z_(−ij); θ) − ϵℒ(z; θ)

is obtained. The influence of up-weighing the importance of x_(ij) on {circumflex over (θ)} is obtained as:

${\frac{d{\hat{\theta}}_{\epsilon,{x_{ij}{❘z❘}}}}{d\epsilon}❘_{\epsilon = 0}} = {H_{\hat{\theta}}^{- 1}\left( {{\bigtriangledown_{\hat{\theta}}{L\left( {z;\hat{\theta}} \right)}} - {\bigtriangledown_{\hat{\theta}}{L\left( {z_{- {ij}};\hat{\theta}} \right)}}} \right)}$

by applying the classical result in Koh and Liang, and Cook and Weisberg, “Residuals and Influence in Regression,” Chapter 3, Assessment of Influence, pp. 101-156, New York, Chapman and Hall 1982 (see, e.g., Eq.3.2.1).

Finally, applying the equation immediately above and the chain rule, the influence of x_(ij) to z′ is obtained as:

IF⁺(x_(ij)❘z, z^(′); θ̂) := ▽_(ϵ)L(z^(′); θ̂_(ϵ, x_(ij❘z)))❘_(ϵ = 0) = ▽_(θ)L(z^(′); θ̂)H_(θ̂)⁻¹(▽_(θ)L(z; θ̂) − ▽_(θ)L(z_(−ij); θ̂)).

IF⁺ measures the influence of a training span on an entire test sequence. Similarly, the influence of a training span to a test span x′_(kl) is also measured by applying Equation 3 above to obtain:

IF⁺⁺(x_(ij)❘z, x_(kl)^(′)❘z^(′); θ̂) := ▽_(ϵ)L(z_(−kl)^(′); θ̂_(ϵ, x_(i, j)❘z)) − ▽_(ϵ)L(z^(′); θ̂_(ϵ, x_(i, j)❘z))❘_(ϵ = 0) = (▽_(θ)L(z_(−kl)^(′); θ̂) − ▽_(θ)L(z^(′); θ̂))H_(θ̂)⁻¹(▽_(θ)L(z; θ̂) − ▽_(θ)L(z_(−ij); θ̂)).

The complete derivation can be found below. IF⁺ and IF⁺⁺ are similar to IF but operate with spans.

Regarding the choice of Spans, theoretically, IF⁺ and IF⁺⁺ can be applied to any text classification problem and dataset with an appropriate choice of the span. If no information about valid span is available, shallow parsing tools or sentence split-tools can be used to shatter an entire text sequence into chunks. Under ideal circumstances, one may even be able to find deterministic spans instead of enumerating candidates. Two aspect-based sentiment analysis datasets are adopted herein that can conveniently identify the deterministic spans, and frame the span selection task as a Reading Comprehension task. See Pranav Rajpurkar et al., “SQUAD: 100,000+ Questions for Machine Comprehension of Text,” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383-2392, Austin, Tex., November 105 (2016) (hereinafter “Rajpurkar”). The details are discussed below.

To achieve the second goal of faithful and Hessian matrix-free explanations, for faithfulness purposes a set of models will be computed that are close to the model {circumflex over (θ)} to be explained (i.e., faithful model variants), and then a Hessian matrix-free explanation process is used to guarantee efficiency. According to an exemplary embodiment, the method of TracIn (see Pruthi) described in Equation 2 above, which is Hessian free by design, is employed as a starting point. TracIn defines the contribution of a training instance to be the sum of its contribution (loss) throughout the entire training life cycle, which eradicated the need for a Hessian matrix. However, this assumption is drastically different from IF's where the contribution of z is obtained solely from the final model {circumflex over (θ)}. By nature, IF is a faithful method, and its explanation is faithful to {circumflex over (θ)}, and TracIn in its basic form is arguably not a faithful method.

The present techniques are based on the assumption that the influence of z on {circumflex over (θ)} is the sum of the influences of all variants close to {circumflex over (θ)}, defining a set of “faithful model variants satisfying the constraint of {{circumflex over (θ)}_(i)|1>δ>>∥{circumflex over (θ)}_(i)−{circumflex over (θ)}∥₂}, namely δ-faithful to {circumflex over (θ)}. The smaller δ is, the more faithful the explanation method is. Instead, the δ for TracIn can be arbitrary large without faithfulness guarantees, as some checkpoints can be far from the final {circumflex over (θ)}. Thus, an δ-faithful explanation method is constructed herein that mirrors TracIn as:

${{TracInF}\left( {z,z^{\prime}} \right)} = {\sum\limits_{i}{\bigtriangledown_{\hat{\theta} + \delta_{i}}{L\left( {{\hat{\theta} + \delta_{i}},z} \right)}\bigtriangledown_{\hat{\theta} + \delta_{i}}{{L\left( {{\hat{\theta} + \delta_{i}},z^{\prime}} \right)}.}}}$

As was described in conjunction with the description of methodology 300 of FIG. 3 above, this importance score TracInF(z,z′) can be used to rank the training examples that are used to explain the test example z′.

The difference between TracIn and TracInF is that the checkpoints used in TracIn are correlated in time whereas all variants of TracInF are conditionally independent. Finding a proper δ_(i) can be challenging. If ill-chosen, {circumflex over (θ)}_(i) may diverge {circumflex over (θ)} so much that it negatively impacts gradient estimation. In one embodiment, δ_(i)=η_(i)g(z_(i)|{circumflex over (θ)}) obtained from a single-step gradient descent g(z_(i)|{circumflex over (θ)}) that is estimated with some training instance z_(i) on model {circumflex over (θ)}, scaled by an i-specific weighting parameter η_(i), which in the simplest case is uniform for all i. Usually, η_(i) should be small enough so that {circumflex over (θ)}+δ_(i) can stay close to {circumflex over (θ)}. Here, η is set as the model learning rate for proof of concept.

1>>δ>∥{circumflex over (θ)} . . . ∥

As to whether TracInF is faithful, it is first noted that any {circumflex over (θ)}+δ_(i) is close to {circumflex over (θ)}. Under the assumption of Lipschitz continuity, there exists a k∈

⁺ such that ∇L({circumflex over (θ)}+δ_(i),z) is bounded around ∇L({circumflex over (θ)},z) by k|η_(i)g²(z_(i)|{circumflex over (θ)}), the second derivative, because |∇L({circumflex over (θ)}+δ_(i),z)−∇L({circumflex over (θ)},z)<k|η_(i)g²(z_(i)|{circumflex over (θ)})|. A proper weighting parameter η_(i) can be chosen so that the right-hand side (RHS) is sufficiently small to bound the loss within a small range. Thus, the gradient of loss, and in turn the TracInF score can stay δ-faithful to {circumflex over (θ)} for a sufficiently small δ, which TracIn cannot guarantee.

A combination of the above techniques is also provided herein. Namely, by combining the insights above related to improved interpretability with spans and faithful, Hessian matrix-free explanations, a final form referred to herein as TracIn⁺⁺ can be obtained:

${{TracIn}^{++}\left( {{x_{kl}^{\prime}{❘{z^{\prime},x_{ij}}❘}z};\hat{\theta}} \right)} = {\sum\limits_{i}{{\left\lbrack {{{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta} + \delta_{i}},z_{- {kl}}^{\prime}} \right)} - {{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta} + \delta_{i}},z^{\prime}} \right)}} \right\rbrack\left\lbrack {{{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta} + \delta_{i}},z} \right)} - {{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta} + \delta_{i}},z_{- {ij}}} \right)}} \right\rbrack}.}}$

This ultimate form mirrors the IF⁺⁺ method, and it satisfies all of the above desiderata on an improved explainability technique. Similarly, TracIn⁺ that mirrors IF⁺ is

${{TracIn}^{+}\left( {z^{\prime},{{x_{ij}❘z};\hat{\theta}}} \right)} = {\sum\limits_{i}{{{{\bigtriangledown\mathcal{L}}\left( {z^{\prime};{\hat{\theta} + \delta_{i}}} \right)}\left\lbrack {{{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta} + \delta_{i}},z} \right)} - {{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta} + \delta_{i}},z_{- {ij}}} \right)}} \right\rbrack}.}}$

Additional Details are now provided. Since the RHS of IF, IF⁺ and IF⁺⁺ equations all involve the inverse of a Hessian Matrix, the computation challenge is now addressed. Following Koh and Liang, the vector-Hessian-inverse-product (VHP) is adopted herein with stochastic estimation (see, e.g., Baydin). The series of stochastic updates, one for each training instance, is performed by the vhp( ) function and the update stops until convergence. See, e.g., PyTorch Tutorials, “A Gentle Introduction to torch.autograd,” accessed from https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html on May 25, 2021 (4 pages). Unfortunately, it was found that naively applying this approach leads to the rapid growth of VHP complexity due to large parameter size. To be specific, in the present case, the parameters are the last two layers of the pretraining approach described in Liu et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692v1 (July 2019) (13 pages) (hereinafter “Liu”), plus the output head, a total of 12M parameters per gradient vector. To stabilize the process, three approaches are taken: 1) applying gradient clipping (set to 100) to avoid accumulating the extreme gradient values; 2) adopting early termination when the norm of VHP stabilizes (usually <1000 training instances, i.e., the depth); and 3) slowly decaying the accumulated VHP with a factor of 0.99 (i.e., the damp) and update with a new vhp( ) estimate with a small learning rate (i.e., the scale) of 0.004. Once obtained, the VHP is first cached and then retrieved to perform the dot-product with the last term. The complexity for each test instance is

(dt) where d is the depth of estimation and t is the time spent on each vhp( ) operation. The time complexity of different IF methods only vary on a constant factor of two.

For each of TracIn, TracIn⁺ and TracIn⁺⁺, multiple model variants need to be created. For TracIn, three checkpoints of the most recent training epochs are saved. For TracIn⁺ and TracIn⁺⁺, one-step training (learning rate 1E-4) is repeatedly performed using a randomly selected mini-batch (size 100) three times to obtain three variants. Those hyper-parameters are not over-tuned for replicability concerns.

As highlighted above, also provided herein are techniques for evaluating the quality of the explanations based on semantic relatedness. See, for example, exemplary methodology 500 of FIG. 5 and description of Sag, below. Advantageously, such evaluations are easier and more intuitive for human users to interpret.

Referring to methodology 500 of FIG. 5 , in step 502 a semantic representation of training span x_(ij) of the training example z is defined. According to an exemplary embodiment, the semantic representation of training span x_(ij) of the training example z is defined as per Equation 5, described below. In step 504, the similarity of the semantic representation of training span x_(ij) of the training example z to a semantic representation of test span x′_(kl) of the test example z′ is measured.

The present semantic evaluation metric (semantic agreement or Sag) is now described, followed by a description of two other popular metrics for comparison. Intuitively, a rational explanation method should rank explanations that are semantically related to the given test instance relatively higher than the less relevant ones. The idea is to first define the semantic representation of a training span x_(ij) of z, and measure its similarity to that of a test span x′_(kl) of z′. Since the present techniques use BERT family as the base model, the embedding of a training span is obtained by the difference of x and its span-masked version x_(ij) as

emb(x _(ij))=emb(x)−emb(x _(−ij)),  (4)

wherein emb is obtained from the embedding of sentence start token such as “[CLS]” in BERT (Devlin) at the last embedding layer. To obtain an embedding of the entire sequence one can simply use the emb(x) without the last term in Equation 4. Thus, all spans are embedded in the same semantic space and the geometric quantities such as cosine or dot-product can measure the similarities of embeddings. The semantic agreement Sag is defined as:

$\begin{matrix} {{{{Sag}\left( {z^{\prime},{\left\{ z_{k} \right\} ❘_{1}^{K}}} \right)} = {\frac{1}{K}{\sum\limits_{z_{k}}{\cos\left( {{{emb}\left( {x_{ij}❘z_{k}} \right)},{{emb}\left( {x_{kl}^{\prime}❘z^{\prime}} \right)}} \right)}}}},} & (5) \end{matrix}$

calculated on a test example z′ on its list of explanations (z,tracIn⁺⁺(z|z′)) Equation 4 above can be used to get emb( ), the embedding of x_(ij). Intuitively, the metric measures the degree to which top-K training spans align with a test span on semantics. As provided above, Equation 5 can be used to determine the semantic representation of training span x_(ij) of z as per step 502 in methodology 500 of FIG. 5 .

Regarding other metrics, with one approach Label Agreement (Lag) (see Kazuaki Hanawa et al., “Evaluation of Similarity-Based Explanations,” arXiv:2006.04528v2 (March 2021) (26 pages)) assumes that the label of an explanation z should agree with that of the text case z′. Accordingly, the top-K training instances are retrieved from the ordered explanation list and the label agreement (Lag) is calculated as:

${{{Lag}\left( {z^{\prime},{\left\{ z \right\} ❘_{1}^{N}}} \right)} = {\frac{1}{K}{\sum\limits_{k \in {❘{1,K}❘}}{\mathcal{I}\left( {y^{\prime}==y_{k}} \right)}}}},$

wherein

(⋅) is an indicator function. Lag measures the degree to which the top-ranked z agree with z′.

With another approach, Re-training Accuracy Loss (Ral) measures the loss of test accuracy after removing the top-K most influential explanations identified by an explanation method. See, for example, Sara Hooker et al., “A Benchmark for Interpretability Methods in Deep Neural Networks,” 33^(rd) Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver Canada (November 2019) (17 pages) (hereinafter “S. Hooker”) and Xiaochuang Han et al., “Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions,” arXiv:2005.06676v1 (May 2020) (11 pages). The higher the loss the better the explanation method is. Formally,

Ral(f,{circumflex over (θ)})=Acc({circumflex over (θ)})−Acc({circumflex over (θ)}′),

wherein {circumflex over (θ)}′ is the model re-trained by the set D^(train)/{z}|₁ ^(K). It is notable that the re-training uses the same set of hyper-parameter settings as training (see description of model training details, below). To obtain {z}|₁ ^(K), the explanation lists are combined for all test instances (by score addition) and the top-K are then removed from this list.

The criteria for dataset selection are two-fold. First, the dataset should have relatively high classification accuracy so that the trained model can behave rationally. Second, the dataset should allow for easy identification of critical/useful text spans to compare span-based explanation methods. Two aspect-based sentiment analysis (ABSA) datasets were chosen. The first (Dataset 1) is a dataset of product reviews where aspects are the terms in the text and each sentence consists of at least two aspects with different sentiment polarities. See Qingnan Jiang et al., “A Challenge Dataset and Effective Models for Aspect-Based Sentiment Analysis,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9^(th) International Joint Conference on Natural Language Processing, pages 6280-6285, Hong Kong, China, Nov. 3-7, 2019), where aspects are the terms in the text. The second (Dataset 2) is a dataset extracted from a question answering platform with location reviews by users where units of text often mention several aspects. See Marzieh Saeidi et al., “SentiHood: Targeted Aspect Based Sentiment Analysis Dataset for Urban Neighborhoods,” arXiv:1610.03771v1 (October 2016) (11 pages). The relevant span of an aspect term can be identified semiautomatically and the models trained with high classification accuracy in both datasets. See below for details. Data statistics are shown in FIG. 6 and data instances are shown in FIG. 7 . Namely, FIG. 6 is a table 600 illustrating data statistics for the training, development and test sets of Dataset 1 and Dataset 2. It is notable that each training instance was treated as aspect-specific, i.e., the concatenation of the aspect term and the text x as model input.

FIG. 7 is a table 700 illustrating semi-automatic span annotation using Dataset 1 and Dataset 2. Namely, in the text, each aspect has a supporting span which is semi-automatically annotated. As shown in FIG. 7 , the spans (shown in different bold shadings) are extracted for each term to serve as explanation units for IF⁺, IF⁺⁺, TracIn⁺ and TracIn⁺⁺. To reduce annotation effort, span extraction is converted into a question answering task (see Rajpurkar) where aspect terms are used to formulate questions such as “How is the service?” which concatenates with the text before being fed into pre-trained machine reading comprehension (RC) models. The output answer is used as the span. When the RC model fails, heuristics are used to extract words before and after the term word, up to the closest sentence boundary. See below for more details. A subset of 100 annotations were sampled and it was found that the RC model has about 70% of Exact Match (see Rajpurkar) and the overall annotation has a high recall of over 90% but low EM due to the involvement of heuristics.

Wrongly-annotated spans may confuse the explanation methods. For example, as shown in FIG. 7 , if the span of location 2 is annotated as “I loved it,” span-based explanation methods will use it to find wrong examples for the explanation. Thus, in order to mitigate the annotation error, test instances with incorrectly annotated spans are omitted, i.e., no tolerance to annotation error for test instances. However, for training instances, the annotation error is not corrected. The reason is that the explanation methods have a chance to rank the wrongly annotated spans lower, i.e., its importance score imp( ) in Equation 3 can be lower, and in turn its influence scores.

The present techniques are further described by way of reference to the following non-limiting examples. With regard to model training, two separate models were trained for Dataset 1 and Dataset 2. The input to the models was the concatenation of the aspect term and the entire text, and the output is a sentiment label. The two models share similar settings: 1. they both use the robustly optimized pretraining approach of Liu from Thomas Wolf et al., “Huggingface Transformers: State-of-the-Art Natural Language Processing,” arXiv:1910.03771v5 (July 2020) (8 pages) (hereinafter “Wolf”), which is fed into the BERTForSequenceClassification function for initialization. The parameters of the last two layers and the output head are fine-tuned using a batch size of 200 for Dataset 1 and 100 for Dataset 2, and max epochs of 100. The optimizer described in Ilya Loshchilov et al., “Decoupled Weight Decay Regularization,” ICLR 2019 (January 2019) (19 pages) was used with weight decay 0.01 and learning rate 1E-4. The models were selected on dev set performance, and both trained models are state-of-the-art: 88.3% on Dataset 1 and 97.6% for Dataset 2.

The explanation methods were then compared. See table 800 in FIG. 8 which illustrates a comparison of the six explanation methods on the two datasets (Dataset 1 and Dataset 2) and three evaluation metrics. From the results in FIG. 8 , the following conclusions can be drawn:

1) The TracIn methods outperform the IF methods according to Sag and Lag metrics. It is seen that both metrics are robust against the choice of K. It is notable that the TracIn methods are not only efficient, but are also effective for extracting explanations compared to IF as per Sag and Lag. 2) Span-based methods (with +) outperform basic methods (w/o +). This is favorable since an explanation can be much easier to comprehend if essential spans in text can be highlighted, and IF⁺⁺ and TracIn⁺⁺ show that such highlighting can be justified by their superiority on the evaluation of Sag and Lag. 3) Sag and Lag show a consistent trend of IF⁺⁺ and TracIn⁺⁺ being superior to the rest of the methods, while Ral results are inconclusive, which resonates with the findings in S. Hooker where they also observed randomness after removing examples under different explanation methods. This suggests that the re-training method may not be a reliable metric due to the randomness and intricate details involved in the re-training process. 4) Sag measures TracIn⁺ differently than Lag which shows that Lag may be an over-simplistic measure by assuming that label y can represent the entire semantics of x, which may be problematic. However, Sag looks into the x for semantics and can properly reflect and align with human judgments.

One important parameter for evaluation metrics is the choice of K for Sag and Lag (K is not discussed for Ral due to its randomness). Here, 200 test instances from Dataset 1 were used as subjects to study the influence of K on Sag and Lag. See plots 900A and 900B in FIGS. 9A and 9B, respectively.

It was found that as K increases, all methods, except for IF and TracInF, decrease on Sag and Lag. The decrease is favorable since the explanation method is putting useful training instances before less useful ones. However, the increase suggests that the explanation method fails to rank useful ones on top. This again confirms that a span-based explanation can take into account the useful information in x and reduce the impact of noisy information involved in IF and TracInF.

The question of how faithful the proposed TracIn⁺⁺ is to {circumflex over (θ)} is next considered. To answer this question, the notion of strictly faithful explanation is first defined and then an explanation of the faithfulness of an explanation method is tested against it. Note that none of the discussed methods is strictly faithful, since IF⁺⁺ used approximated inverse-Hessian and TracIn⁺⁺ is an δ away from being strictly faithful. To obtain ground truth, TracIn⁺⁺ is modified to use a single checkpoint {circumflex over (θ)} as the “ultimately faithful” explanation method. Then, an explanation list is obtained for each test instance and its Spearman Correlation is computed with the list obtained from the ground truth. The higher the correlation, the more faithful the method is.

It was discovered that TracIn⁺⁺ has a similar mean as IF⁺⁺ but has a much lower variance, showing its stability over IF⁺⁺. See table 1000 in FIG. 10 which illustrates a comparison of Spearman Correlation with Ground truth. The experiment is run 5 times each. The “Control” is only different from TracIn⁺⁺ on the models used, i.e., “Control” uses three checkpoints of the latest epochs, but TracIn⁺⁺ uses three δ-faithful model variants. Both methods are more faithful to Ground truth than Control that uses checkpoints, showing that the model “ensemble” around {circumflex over (θ)} may be a better choice than “checkpoint averaging” for model explanations. Further explorations may be needed since there are many variables in this comparison.

A description of the span extraction details is now provided. A pre-trained reading comprehension (RC) model of Wolf was applied. The templates used for the RC model are shown in FIG. 11 . The following heuristics are used when the RC model fails: 1) the RC model is considered to fail when no span is extracted, or the entire text is returned as an answer. 2) The location of the term in the text is identified and the scope is expanded from the location both on the left and on the right, and when sentence boundary is found, the process is stopped and the span is returned as the span for the term. It is notable that cases are not found where the words around a term do not necessarily talk about the term. However, such a case is extremely rare.

The following is a derivation of IF⁺⁺:

$\begin{matrix} {{\mathcal{I}_{{pert},{loss}}\left( {X_{ij},{z_{- {kl}};\hat{\theta}}} \right)}:={{\bigtriangledown_{\epsilon}{{imp}\left( {{X_{ij}❘X};{\hat{\theta}}_{\epsilon,Z_{ij},Z}} \right)}}❘_{\epsilon = 0}}} \\ {= {\frac{{dimp}\left( {{X_{ij}❘X};\hat{\theta}} \right)}{\hat{d\theta}}\left( {\frac{{\hat{d\theta}}_{\epsilon,Z_{- {kl}},Z}}{d\epsilon}❘_{\epsilon = 0}} \right)}} \\ {= {\left( {{\bigtriangledown_{\theta}{O_{y}\left( {X,\hat{\theta}} \right)}} - {\bigtriangledown_{\theta}{O_{y}\left( {X_{- {ij}},\hat{\theta}} \right)}}} \right)\left( {\frac{d{\hat{\theta}}_{\epsilon,Z_{- {kl}},Z}}{d\epsilon}❘_{\epsilon = 0}} \right)}} \\ {= \begin{matrix} {- \left( {{\bigtriangledown_{\theta}O_{y}\left( {X,\hat{\theta}} \right)} - {\bigtriangledown_{\theta}O_{y}\left( {X_{- {ij}},} \right.}} \right.} \\ {\left. \left. \hat{\theta} \right) \right)H_{\hat{\theta}}^{- 1}\left( {{\bigtriangledown_{\theta}L\left( {z_{- {kl}},\hat{\theta}} \right)} - {\bigtriangledown_{\theta}{L\left( {z,\hat{\theta}} \right)}}} \right)} \end{matrix}} \end{matrix}$

The following is a derivation of TracIn⁺ and TracIn⁺⁺. Similar to IF (see Koh and Liang) and TracIn (see Pruthi), one begins with the Taylor expansion on point {circumflex over (θ)}_(i) around z′ and z′_(−ij) as:

({circumflex over (θ)}_(t+1) ,z′)−

({circumflex over (θ)}_(t) ,z′)+∇

({circumflex over (θ)}_(t) ,z′)({circumflex over (θ)}_(t+1)−{circumflex over (θ)}_(t))

({circumflex over (θ)}_(t+1) ,z′ _(−ij))˜

({circumflex over (θ)}_(t) ,z′ _(−ij))+∇

({circumflex over (θ)}_(t) ,z′ _(−ij))({circumflex over (θ)}_(t+1)−{circumflex over (θ)}_(t))

If Stochastic Gradient Descent (SGD) is assumed for optimization for simplicity, ({circumflex over (θ)}_(t+1)−{circumflex over (θ)}_(t))=λ∇

({circumflex over (θ)}_(t),z). Thus, putting it in the above equations and performing subtraction provides,

({circumflex over (θ)}_(t+1) ,z′)−

({circumflex over (θ)}_(t+1) ,z′−ij)˜

({circumflex over (θ)}_(t) ,z′)+[∇

({circumflex over (θ)}_(t) ,z′)−∇

({circumflex over (θ)}_(t) ,z′ _(−ij))]λ∇

({circumflex over (θ)}_(t) ,z)

And,

imp(x′ _(ij) |z′;{circumflex over (θ)} _(t+1))−imp(x′ _(ij) |z′;{circumflex over (θ)} _(t))˜[∇

({circumflex over (θ)}_(t) ,z′ _(−ij))−∇

({circumflex over (θ)}_(t) ,z′)]λ∇

({circumflex over (θ)}_(t) ,z).

So, the left term is the change of importance by parameter change. It can be interpreted as the change of importance score of span x_(ij) w.r.t the parameter of networks. Then, integrating over all of the contributions from different points in the training process provides:

${{TracIn}^{+}\left( {{x_{ij}^{\prime}❘z^{\prime}},z} \right)} = {\sum\limits_{t}{\left\lbrack {{{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z_{- {ij}}^{\prime}} \right)} - {{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z^{\prime}} \right)}} \right\rbrack\lambda\bigtriangledown{{\mathcal{L}\left( {{\hat{\theta}}_{t},z} \right)}.}}}$

The above formation is very similar to TracIn where a single training instance z is evaluated as a whole. But, of interest is the case where a meaning unit x_(kl) in z can be evaluated for influence. Thus, the same logic of the above equation is applied to z_(−kl), the perturbed training instance where token k to l is masked, as:

${{TracIn}^{+}\left( {{x_{ij}^{\prime}❘z^{\prime}},z_{- {kl}}} \right)} = {\sum\limits_{t}{\left\lbrack {{{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z_{- {ij}}^{\prime}} \right)} - {{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z^{\prime}} \right)}} \right\rbrack\lambda{{{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z_{- {kl}}} \right)}.}}}$

Then, the difference TracIn⁺(x′_(ij)|z′,z)−TracIn⁺(x′_(ij)|z′,z_(−kl)) can indicate how much impact a training span x_(kl) has on test span x′_(ij). Formally, the influence of x_(kl) on x_(ij)′ is

${{TracIn}^{++}\left( {x_{ij}^{\prime},{x_{- {kl}}❘z^{\prime}},z} \right)} = {\lambda{\sum\limits_{t}\left\lbrack {{{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z_{- {ij}}^{\prime}} \right)} - {{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z} \right)} - {{\bigtriangledown\mathcal{L}}\left( {{\hat{\theta}}_{t},z_{- {kl}}} \right)}} \right\rbrack}}$

Such a form is very easy to implement since each item in summation requires only four (4) gradient estimates.

As will be described below, one or more elements of the present techniques can optionally be provided as a service in a cloud environment. For instance, one or more steps of methodology 200 of FIG. 2 , one or more steps of methodology 300 of FIG. 3 , one or more steps of methodology 400 of FIG. 4 and/or one or more steps of methodology 500 of FIG. 5 can be performed on a dedicated cloud server to take advantage of high-powered CPUs and GPUs.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Turning now to FIG. 12 , a block diagram is shown of an apparatus 1200 for implementing one or more of the methodologies presented herein. By way of example only, apparatus 1200 can be configured to implement one or more steps of methodology 200 of FIG. 2 , one or more steps of methodology 300 of FIG. 3 , one or more steps of methodology 400 of FIG. 4 and/or one or more steps of methodology 500 of FIG. 5 .

Apparatus 1200 includes a computer system 1210 and removable media 1250. Computer system 1210 includes a processor device 1220, a network interface 1225, a memory 1230, a media interface 1235 and an optional display 1240. Network interface 1225 allows computer system 1210 to connect to a network, while media interface 1235 allows computer system 1210 to interact with media, such as a hard drive or removable media 1250.

Processor device 1220 can be configured to implement the methods, steps, and functions disclosed herein. The memory 1230 could be distributed or local and the processor device 1220 could be distributed or singular. The memory 1230 could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to be read from, or written to, an address in the addressable space accessed by processor device 1220. With this definition, information on a network, accessible through network interface 1225, is still within memory 1230 because the processor device 1220 can retrieve the information from the network. It should be noted that each distributed processor that makes up processor device 1220 generally contains its own addressable memory space. It should also be noted that some or all of computer system 1210 can be incorporated into an application-specific or general-use integrated circuit.

Optional display 1240 is any type of display suitable for interacting with a human user of apparatus 1200. Generally, display 1240 is a computer monitor or other similar display.

Referring to FIG. 13 and FIG. 14 , it is to be understood that although this disclosure includes a detailed description on cloud computing, implementation of the teachings recited herein are not limited to a cloud computing environment. Rather, embodiments of the present invention are capable of being implemented in conjunction with any other type of computing environment now known or later developed.

Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.

Characteristics are as follows:

On-demand self-service: a cloud consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with the service's provider.

Broad network access: capabilities are available over a network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).

Resource pooling: the provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to demand. There is a sense of location independence in that the consumer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).

Rapid elasticity: capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.

Measured service: cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

Service Models are as follows:

Software as a Service (SaaS): the capability provided to the consumer is to use the provider's applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based e-mail). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.

Platform as a Service (PaaS): the capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including networks, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.

Infrastructure as a Service (IaaS): the capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).

Deployment Models are as follows:

Private cloud: the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on-premises or off-premises.

Community cloud: the cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on-premises or off-premises.

Public cloud: the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.

Hybrid cloud: the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).

A cloud computing environment is service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. At the heart of cloud computing is an infrastructure that includes a network of interconnected nodes.

Referring now to FIG. 13 , illustrative cloud computing environment 50 is depicted. As shown, cloud computing environment 50 includes one or more cloud computing nodes 10 with which local computing devices used by cloud consumers, such as, for example, personal digital assistant (PDA) or cellular telephone 54A, desktop computer 54B, laptop computer 54C, and/or automobile computer system 54N may communicate. Nodes 10 may communicate with one another. They may be grouped (not shown) physically or virtually, in one or more networks, such as Private, Community, Public, or Hybrid clouds as described hereinabove, or a combination thereof. This allows cloud computing environment 50 to offer infrastructure, platforms and/or software as services for which a cloud consumer does not need to maintain resources on a local computing device. It is understood that the types of computing devices 54A-N shown in FIG. 13 are intended to be illustrative only and that computing nodes 10 and cloud computing environment 50 can communicate with any type of computerized device over any type of network and/or network addressable connection (e.g., using a web browser).

Referring now to FIG. 14 , a set of functional abstraction layers provided by cloud computing environment 50 (FIG. 13 ) is shown. It should be understood in advance that the components, layers, and functions shown in FIG. 14 are intended to be illustrative only and embodiments of the invention are not limited thereto. As depicted, the following layers and corresponding functions are provided:

Hardware and software layer 60 includes hardware and software components. Examples of hardware components include: mainframes 61; RISC (Reduced Instruction Set Computer) architecture based servers 62; servers 63; blade servers 64; storage devices 65; and networks and networking components 66. In some embodiments, software components include network application server software 67 and database software 68.

Virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities may be provided: virtual servers 71; virtual storage 72; virtual networks 73, including virtual private networks; virtual applications and operating systems 74; and virtual clients 75.

In one example, management layer 80 may provide the functions described below.

Resource provisioning 81 provides dynamic procurement of computing resources and other resources that are utilized to perform tasks within the cloud computing environment. Metering and Pricing 82 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may include application software licenses. Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. User portal 83 provides access to the cloud computing environment for consumers and system administrators. Service level management 84 provides cloud computing resource allocation and management such that required service levels are met. Service Level Agreement (SLA) planning and fulfillment 85 provide pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA.

Workloads layer 90 provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include: mapping and navigation 91; software development and lifecycle management 92; virtual classroom education delivery 93; data analytics processing 94; transaction processing 95; and explaining a machine learning model 96.

Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope of the invention. 

What is claimed is:
 1. A method for explaining a machine learning model {circumflex over (θ)}, the method comprising: training the machine learning model {circumflex over (θ)} with training data D; obtaining a decision of the machine learning model {circumflex over (θ)}; masking one or more datapoints in the training data D; determining whether a new decision of the machine learning model {circumflex over (θ)} obtained after the masking is same as the decision of the machine learning model {circumflex over (θ)} obtained prior to the masking; and using the masking to explain which of the one or more datapoints in the training data D are significant.
 2. The method of claim 1, wherein the one or more datapoints in the training data D that, when masked, change the decision of the machine learning model {circumflex over (θ)} are significant.
 3. The method of claim 1, wherein the machine learning model {circumflex over (θ)} is used for natural language processing.
 4. The method of claim 1, wherein the machine learning model comprises a trained neural network.
 5. The method of claim 1, further comprising: identifying a training span x_(ij) in a training example z=(x, y) from the training data D, wherein x is a training sequence and y is the decision of the machine learning model {circumflex over (θ)}; masking the training span x_(ij) to provide z=(x_(−ij), y) as the masking, wherein z_(−ij) is corresponding training data to the training span x_(ij), and x_(−ij) is a sequence in which training span x_(ij) is masked; and determining an importance of the training span x_(ij) on the training example z=(x, y) using the masking.
 6. The method of claim 5, wherein the importance of the training span x_(ij) is determined using a loss gradient.
 7. The method of claim 6, further comprising: determining the loss gradient as: imp(x _(ij) |z,{circumflex over (θ)})=

(z _(−ij);{circumflex over (θ)})−

(z;{circumflex over (θ)}).
 8. The method of claim 7, further comprising: determining an influence of the training span x_(ij) on the machine learning model {circumflex over (θ)} by scaling imp(x_(ij)|z,{circumflex over (θ)}).
 9. The method of claim 1, further comprising: determining an influence of a training example z on a test example z′.
 10. The method of claim 9, further comprising: determining an importance ∇imp(x′_(kl)|z′;{circumflex over (θ)}) of a test span x′_(kl) on the test example z′; determining an importance ∇imp(x_(ij)|z;{circumflex over (θ)}) of a training span x_(ij) on the training example z; and determining the influence of the training example z on the test example z′ using ∇imp(z′_(kl)|z′;{circumflex over (θ)}) and ∇imp(x_(ij)|z;{circumflex over (θ)}).
 11. The method of claim 10, wherein the influence of the training example z on the test example z′ is determined as ∇imp(x′_(kl)|z′;{circumflex over (θ)})∇imp(x_(ij)|z;{circumflex over (θ)}).
 12. The method of claim 10, further comprising: evaluating whether the training span x_(ij) of the training example z is semantically related to the test span x′_(kl) of the test example z′.
 13. The method of claim 12, further comprising: defining a semantic representation of the training span x_(ij) of training example z; and measuring the similarity of the semantic representation of the training span x_(ij) of training example z to a semantic representation of the test span x′_(kl) of the test example z′.
 14. A method for explaining a machine learning model {circumflex over (θ)}, the method comprising: identifying a training span x_(ij) in a training example z=(x, y) from the training data D used to train the machine learning {circumflex over (θ)}, wherein x is a training sequence and y is the decision of the machine learning model {circumflex over (θ)}; masking the training span x_(ij) to provide z_(−ij)=(x_(−ij), y), wherein z_(−ij) is corresponding training data to the training span x_(ij), and x_(−ij) is a sequence in which training span x_(ij) is masked; and determining an importance of the training span x_(ij) on the training example z=(x, y) using the masking.
 15. The method of claim 14, further comprising: determining an influence of the training span x_(ij) on the machine learning model {circumflex over (θ)}.
 16. The method of claim 14, further comprising: determining an influence of the training example z=(x, y) on a test example z′.
 17. The method of claim 16, further comprising: determining an importance ∇imp(x′_(kl)|z′,{circumflex over (θ)}); of a test span x′_(kl) on the test example z′; determining an importance ∇imp(x_(ij)|z;{circumflex over (θ)}) of the training span x_(ij) on the training example z=(x, y); and determining an influence of the training example z=(x, y) on the test example z′ using ∇imp(x′_(kl)|z′;{circumflex over (θ)}) and ∇imp(x_(ij)|z;{circumflex over (θ)}).
 18. The method of claim 17, further comprising: evaluating whether the training span of the training example z=(x, y) is semantically related to the test span x′_(kl) of the test example z′.
 19. A non-transitory computer program product for explaining a machine learning model {circumflex over (θ)}, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to: train the machine learning model {circumflex over (θ)} with training data D; obtain a decision of the machine learning model {circumflex over (θ)}; mask one or more datapoints in the training data D; determine whether a new decision of the machine learning model {circumflex over (θ)} obtained after the masking is same as the decision of the machine learning model {circumflex over (θ)} obtained prior to the masking; and use the masking to explain which of the one or more datapoints in the training data D are significant.
 20. The non-transitory computer program product of claim 19, wherein the datapoints in the training data D that, when masked, change the decision of the machine learning model {circumflex over (θ)} are significant. 